Distributional RL – Simple Machine Learning

您所在的位置：网站首页 › distributional learning › Distributional RL – Simple Machine Learning

Distributional RL – Simple Machine Learning

2024-07-17 15:28| 来源: 网络整理| 查看: 265

1 Q-learning

In Reinforcement Learning we are interested in maximizing the Expected Return so we usually work directly with those expectations. For instance, in Q-learning with function approximation we want to minimize the error \[\begin{align*}\mathbb{E}_{s,a}\left[\left(r(s,a)+\gamma\mathbb{E}_{s'}\left[\max_{a'}Q(s',a')\right]-Q(s,a)\right)^{2}\right],\end{align*}\] or, equivalently, \[\begin{align} \mathbb{E}_{s,a,s'}\left[\left(r(s,a)+\gamma\max_{a'}Q(s',a')-Q(s,a)\right)^{2}\right],\label{eq:classic_td_error} \end{align}\] where \(r(s,a)\) is the expected immediate reward. In semi-gradient methods we do this by moving \(Q(s,a)\) towards the target \(r(s,a)+\gamma\max_{a'}Q(s',a')\), pretending that the target is constant, and in DQN(Mnih et al. 2015) we even freeze the “target network” to improve stability even further.

The main idea of Distributional RL(M. G. Bellemare, Dabney, and Munos 2017) is to work directly with the full distribution of the return rather than with its expectation. Let the random variable \(Z(s,a)\) be the return obtained by starting from state \(s\), performing action \(a\) and then following the current policy. Then \[\begin{align*}Q(s,a)=\mathbb{E}[Z(s,a)].\end{align*}\] Instead of trying to minimize the error \(\ref{eq:classic_td_error}\), which is basically a distance between expectations, we can instead try to minimize a distributional error, which is a distance between full distributions: \[\begin{align} \sup_{s,a}\mathrm{dist}\left(R(s,a)+\gamma Z(s',a^{*}),Z(s,a)\right)\label{eq:distrib_td_error}\\ s'\sim p(\cdot|s,a)\nonumber \end{align}\] where you can mentally replace \(\sup\) with \(\max\), \(R(s,a)\) is the random variable for the immediate reward, and \[\begin{align*}a^{*}=\underset{a'}{\mathrm{arg\,max}\,}Q(s',a')=\underset{a'}{\mathrm{arg\,max}\,}\mathbb{E}[Z(s',a')].\end{align*}\] Note that we’re still using \(Q(s,a)\), i.e. the expected return, to decide which action to pick, but we’re trying to optimize distributions rather than expectations (of those distributions).

There’s a subtlety in expression \(\ref{eq:distrib_td_error}\): if \(s,a\) are constant, \(Z(s,a)\) is a random variable, but even more so when \(s\) or \(a\) are themselves random variables!

2 Policy Evaluation

Let’s consider policy evaluation for a moment. In this case we want to minimize \[\begin{align*}\mathbb{E}_{s,a,s',a'}\left[\left(r(s,a)+\gamma Q(s',a')-Q(s,a)\right)^{2}\right]\end{align*}\] We can define the Bellman operator for evaluation as follows: \[\begin{align*}(\mathcal{T}^{\pi}Q)(s,a)=\mathbb{E}_{s'\sim p(\cdot|s,a),a'\sim\pi(\cdot|s')}[r(s,a)+\gamma Q(s',a')]\end{align*}\] The Bellman operator \(\mathcal{T}^{\pi}\) is a \(\gamma\)-contraction, meaning that \[\begin{align*}\mathrm{dist}\left(\mathcal{T}Q_{1},\mathcal{T}Q_{2}\right)\leq\gamma\mathrm{dist}\left(Q_{1},Q_{2}\right),\end{align*}\] so, since \(Q^{\pi}\) is a unique fixed point (i.e. \(\mathcal{T}Q=Q\iff Q=Q^{\pi}\)), we must have that \(\mathcal{T}^{\infty}Q=Q^{\pi}\), disregarding approximation errors.

It turns out (M. G. Bellemare, Dabney, and Munos 2017) that this result can be ported to the distributional setting. Let’s define the Bellman distribution operator for evaluation in an analogous way: \[\begin{align*} (\mathcal{T}_{D}^{\pi}Z)(s,a) & =R(s,a)+\gamma Z(s',a')\\ s' & \sim p(\cdot|s,a)\\ a' & \sim\pi(\cdot|s') \end{align*}\] \(\mathcal{T}_{D}^{\pi}\) is a \(\gamma\)-contraction in the Wasserstein distance \(\mathcal{W}\), i.e. \[\begin{align*}\sup_{s,a}\mathcal{W}\left(\mathcal{T}_{D}^{\pi}Z_{1}(s,a),\mathcal{T}_{D}^{\pi}Z_{2}(s,a)\right)\leq\gamma\sup_{s,a}\mathcal{W}(Z_{1}(s,a),Z_{2}(s,a))\end{align*}\] This isn’t true for the KL divergence.

Unfortunately, this result doesn’t hold for the control (the one with the \(\max\)) version of the distributional operator.

3 KL divergence 3.1 Definition

I warn you that this subsection is highly informal.

If \(p\) and \(q\) are two distributions with same support (i.e. their pdfs are non-zero at the same points), then their KL divergence is defined as follows: \[\begin{align*}\mathrm{KL}(p\|q)=\int p(x)\log\frac{p(x)}{q(x)}dx.\end{align*}\]

Let’s consider the discrete case: \[\begin{align*}\mathrm{KL}(p\|q)=\sum_{i=1}^{N}p(x_{i})\log\frac{p(x_{i})}{q(x_{i})}=\sum_{i=1}^{N}p(x_{i})[\log p(x_{i})-\log q(x_{i})].\end{align*}\] As we can see, we’re basically comparing the scores at the points \(x_{1},\ldots,x_{N}\), weighting each comparison according to \(p(x_{i})\). Note that the KL doesn’t make use of the values \(x_{i}\) directly: only their probabilities are used! Moreover, if \(p\) and \(q\) have different supports, the KL is undefined.

3.2 How to use it

Now say we’re using DQN and extract \((s,a,r,s')\) from the replay buffer. A “sample of the target distribution” is \(r+\gamma Z(s',a^{*})\), where \(a^{*}=\mathrm{arg\,max}_{a'}Q(s',a')\). We want to move \(Z(s,a)\) towards this target (by keeping the target fixed).

I called \(r+\gamma Z(s',a^{*})\) a “sample of a distribution”, but the correct way to say it is that \(r+\gamma Z(s',a^{*})\) is a collection of samples of the real target distribution. The target distribution we want to learn is \(r+\gamma Z(s',a^{*})\) where \(r\) and \(s'\) are random variables. Since we sampled \(r\) and \(s'\), then the atoms of \(r+\gamma Z(s',a^{*})\) are just samples from the real target distribution and not a representation in atom form of that distribution. Indeed, to get a single sample from the real target distribution, we should first sample \(r\) and \(s'\), and, finally, sample from the distribution \(r+\gamma Z(s',a^{*})\). This is almost exactly what we’re doing since \(r\) and \(s'\) extracted from the replay buffer were indeed sampled. The only difference is that instead of sampling from \(r+\gamma Z(s',a^{*})\), we use all its atoms.

Let’s say we have a net which models \(Z\) by taking a state \(s\) and returning a distribution \(Z(s,a)\) for each action. For instance, we can represent each distribution through a softmax like we often do in Deep Learning for classification tasks. In particular, let’s choose some fixed values \(x_{1},\ldots,x_{N}\) for the support of all the distributions returned by the net. To simplify things, let’s make them equidistant so that \[\begin{align*}x_{i+1}-x_{i}=d=(x_{N}-x_{1})/(N-1),\qquad i=1,\ldots,N-1\end{align*}\] The pmf looks like a comb:

Figure 1.

Since the values \(x_{1},\ldots,x_{N}\) are fixed, we just have to return \(N\) probabilities for each \(Z(s,a)\), so the net takes a single state and returns \(|\mathcal{A}|N\) scalars, where \(|\mathcal{A}|\) is the number of possible actions.

If \(p_{1},\ldots,p_{N}\) and \(q_{1},\ldots,q_{N}\) are the probabilities of the two distributions \(p\) and \(q\), then their KL is simply \[\begin{align*}\mathrm{KL}(p\|q)=\sum_{i=1}^{N}p_{i}\log\frac{p_{i}}{q_{i}}=H(p,q)-H(p)\end{align*}\] and if you’re optimizing wrt \(q\) (i.e. you’re moving \(q\) towards \(p\)), then you can drop the entropy term.

Also, we can recover \(Q(s,a)\) very easily: \[\begin{align*}Q(s,a)=\mathbb{E}[Z(s,a)]=\sum_{i=1}^{N}p_{i}x_{i}.\end{align*}\]

The interesting part is the transformation. In distributional Q-learning we want to move \(Z(s,a)\) towards \(r+\gamma Z(s',a^{*})\), but how do we put \(p\) in “standard comb form”? This is the projection part described in (M. G. Bellemare, Dabney, and Munos 2017) and it’s very easy. To form the target distribution we start from \(p=Z(s',a^{*})\), which is already in the standard form \(p_{1},\ldots,p_{N}\) and we look at the pairs \((x_{1},p_{1}),\ldots,(x_{N},p_{N})\) as if they represented samples with weights, which the authors of (M. G. Bellemare, Dabney, and Munos 2017) call atoms. This means that we can transform the distribution \(p\) just by transforming the position of its atoms. The transformed atoms corresponding to \(r+\gamma Z(s',a^{*})\) are \[\begin{align*}(r+\gamma x_{1},p_{1}),(r+\gamma x_{2},p_{2}),\ldots,(r+\gamma x_{N},p_{N}).\end{align*}\] Note that the weights \(p_{i}\) don’t change. The problem is that now we have atoms which aren’t in the standard positions \(x_{1},\ldots,x_{N}\). The solution proposed in (M. G. Bellemare, Dabney, and Munos 2017) is to split each misaligned atom into the two closest aligned atoms by making sure to distribute its weight according to its distance from the two misaligned atoms:

Figure 2.

Observe the proportions very carefully. Let’s say the green atom has weight \(w.\) For some constants \(c\), the green atom is at distance \(3c\) from \(x_{6}\) and \(c\) from \(x_{7}\). Indeed, the atom at \(x_{6}\) receives weight \(\frac{1}{4}w\) and the atom at \(x_{7}\) weight \(\frac{3}{4}w\), which makes sense. Also, note that the probability mass is conserved so there’s no need to normalize after the splitting. Of course, since we need to split all the transformed atoms, individual aligned atoms can receive contributions from different atoms. We simply sum all the contributions. This is how the authors do it, but it’s certainly not the only way.

3.3 The full algorithm

Here’s the algorithm taken directly (cut & pasted) from (M. G. Bellemare, Dabney, and Munos 2017):

Figure 3.

Assume we’ve just picked \((x_{t},a_{t},r_{t},x_{t+1})\) from the replay buffer in some variant of the DQN algorithm, so \(x\) is used to indicate states. The \(z_{0},\ldots,z_{N-1}\) are the fixed global positions of the atoms (i.e. our \(x_{1},\ldots,x_{N}\) in figure 2). Let’s assume there’s just a global \(\gamma\).

Let’s go through the algorithm in detail assuming we’re using a neural network for \(Z\):

We feed \(x_{t+1}\) to our net which outputs an \(|\mathcal{A}|\times N\) matrix \(M(x_{t+1})\), i.e. each row corresponds to a single action and contains the probabilities for the \(N\) atoms. That is, the row for action \(a\) contains the vector \[\begin{align*}(p_{0}(x_{t+1},a),\ldots,p_{N-1}(x_{t+1},a))\end{align*}\]

We compute all the \[\begin{align*}Q(x_{t+1},a)=\mathbb{E}\left[Z(x_{t+1},a)\right]=\sum_{i=0}^{N-1}z_{i}p_{i}(x_{t+1},a)\end{align*}\] as follows: \[\begin{align*}Q(x_{t+1})=M(x_{t+1})\begin{bmatrix}z_{0}\\ z_{1}\\ \vdots\\ z_{N-1} \end{bmatrix}.\end{align*}\] Note that \(Q(x_{t+1})\) is a column vector of length \(|\mathcal{A}|\).

Now we can determine the optimum action \[\begin{align*}a^{*}=\mathrm{arg\,max_{a}}Q(x_{t+1},a)\end{align*}\] Let \(q=(q_{0},\ldots,q_{N-1})\) be the row of \(M(x_{t+1})\) corresponding to \(a^{*}\).

\(m_{0},\ldots,m_{N-1}\) will accumulate the probabilities of the aligned atoms of the target distribution \(r_{t}+\gamma Z(x_{t+1},a^{*})\). We start by zeroing them.

The non-aligned atoms of the target distribution are at positions \[\begin{align*}\hat{\mathcal{T}}z_{j}=r_{t}+\gamma z_{j},\qquad j=0,\ldots,N-1\end{align*}\] We clip those positions so that they are in \([V_{\mathrm{MIN}},V_{\mathrm{MAX}}]\), i.e. \([z_{0},z_{N-1}]\).

Assuming that the adjacent aligned atoms are at distance \(\Delta z\), the indices of the closest aligned atoms on the left and on the right of \(\hat{\mathcal{T}}z_{j}\) are, respectively: \[\begin{align*} l & =\left\lfloor \frac{\hat{\mathcal{T}}z_{j}-z_{0}}{\Delta z}\right\rfloor \\ u & =\left\lceil \frac{\hat{\mathcal{T}}z_{j}-z_{0}}{\Delta z}\right\rceil \end{align*}\]

Now we need to split the weight of \(\hat{\mathcal{T}}z_{j}\), which is \(q_{j}\), between \(m_{l}\) and \(m_{r}\) as we saw before. Note that \[\begin{align*} (u)-(b_{j}) & =\left(\frac{z_{u}-z_{0}}{\Delta z}\right)-\left(\frac{\hat{\mathcal{T}}z_{j}-z_{0}}{\Delta z}\right)=\frac{z_{u}-\hat{\mathcal{T}}z_{j}}{z_{u}-z_{l}}\\ (b_{j})-(l) & =\left(\frac{\hat{\mathcal{T}}z_{j}-z_{0}}{\Delta z}\right)-\left(\frac{z_{l}-z_{0}}{\Delta z}\right)=\frac{\hat{\mathcal{T}}z_{j}-z_{l}}{z_{u}-z_{l}} \end{align*}\] which means that, as we said before, the weight \(q_{j}\) is split between \(z_{l}\) and \(z_{u}\) (indeed, \(u-b_{j}+b_{j}-l=1)\), and the contribution to \(m_{l}\) is proportional to the distance of \(\hat{\mathcal{T}}z_{j}\) from \(z_{u}\). The more distant it is from \(z_{u}\), the higher the contribution to \(m_{l}\).

Now we have the probabilities \(m_{0},\ldots,m_{N-1}\) of the aligned atoms of \(r_{t}+\gamma Z(x_{t+1},a^{*})\) and, of course, the probabilities \[\begin{align*}p_{0}(x_{t},a_{t};\theta),\ldots,p_{N-1}(x_{t},a_{t};\theta)\end{align*}\] of the aligned atoms of \(Z(x_{t},a)\), which are the ones we want to update. Thus \[\begin{align*} \nabla_{\theta}\mathrm{KL}(m\|p_{\theta}) & =\nabla_{\theta}\sum_{i=0}^{N-1}m_{i}\log\frac{m_{i}}{p_{\theta}}\\ & =\nabla_{\theta}\left[H(m,p_{\theta})-H(m)\right]\\ & =\nabla_{\theta}H(m,p_{\theta}) \end{align*}\] That is, we can just use the cross-entropy \[\begin{align*}H(m,p_{\theta})=-\sum_{i=0}^{N-1}m_{i}\log p_{i}(x_{t},a_{t};\theta)\end{align*}\] for the loss.

4 Wasserstein distance

The first paper (M. G. Bellemare, Dabney, and Munos 2017) about distributional RL left a gap between theory and practice because the theory requires the Wasserstein distance, but in practice they used a KL-based procedure.

The second paper (Dabney et al. 2017) closes the gap in a very elegant way.

4.1 A different idea

This time I won’t start with a definition, but with an idea. Rather than use atoms with fixed positions, but variable weights, let’s do the opposite: let’s use atoms with fixed weights, but variable positions. Moreover, let’s use the same weight for each atom, i.e. \(1/N\) if the atoms are \(N\).

But how do we represent distributions this way? It’s very simple, really. We slice up the distribution we want to represent into \(N\) slices of \(1/N\) mass and put each atom at the median of a slice. This makes sense; in fact, the atoms weigh \(1/N\) as well:

Figure 4. The distribution has been sliced up into slices of equal probability mass and red points have been placed in the center of mass of each slice. For the code, see subsection 4.2.4. This image is produced through sampling as described in the following sections.

If the atoms are \(N\) then the \(i\)-th atom corresponds to a quantile of \[\begin{align*}\hat{\tau}_{i}=\frac{2(i-1)+1}{2N},\qquad i=1,\ldots,N\end{align*}\] as shown in the following picture:

Figure 5. 4.2 How do we determine a quantile? 4.2.1 Determining the median

The median is just the \(0.5\) quantile, i.e. a point which has \(0.5\) mass on the left and \(0.5\) mass on the right. In other words, it splits the probability mass in half. So let’s say we have a random variable \(X\) and we know how to draw samples. How can we compute the median? We start with a guess \(\theta\), draw some samples and if \(\theta\) has more samples on the left than on the right, we move it a little to the left. By symmetry, we move it to the right if it has more samples on the right. Then we repeat the process and keep updating \(\theta\) until convergence.

We should move \(\theta\) in proportion to the disparity between the two sides, so let’s decide that each sample on the left subtract \(\alpha\) and each sample on the right add \(\alpha\). Basically, \(\alpha\) is a learning rate. If it’s too small the algorithm takes too long and if it’s too big the algorithm fluctuates a lot around the optimal solution. Here’s a picture about this method:

Figure 6.

We reach the equilibrium when \(\theta\) is the median. Doesn’t this look like SGD with a minibatch of \(16\) samples and learning rate \(\alpha\)? What’s the corresponding loss? The loss is clearly \[\begin{align}L_{\theta}=\mathbb{E}_{X}[|X-\theta|]\label{eq:median_loss}\end{align}\] This should look familiar to any statistician. Note that in the picture above we’re adding the gradients, but when we minimize we subtract them so gradients on the left of \(\theta\) must be \(1\) and on the right \(-1\): \[\begin{align*}\nabla_{\theta}L_{\theta}=\begin{cases} \nabla_{\theta}(\theta-X)=1 & \text{if }X

【本文地址】

今日新闻

distributional learning

Distributional RL – Simple Machine Learning

Distributional RL – Simple Machine Learning

今日新闻

推荐新闻